8 research outputs found
Towards Label-free Scene Understanding by Vision Foundation Models
Vision foundation models such as Contrastive Vision-Language Pre-training
(CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot
performance on image classification and segmentation tasks. However, the
incorporation of CLIP and SAM for label-free scene understanding has yet to be
explored. In this paper, we investigate the potential of vision foundation
models in enabling networks to comprehend 2D and 3D worlds without labelled
data. The primary challenge lies in effectively supervising networks under
extremely noisy pseudo labels, which are generated by CLIP and further
exacerbated during the propagation from the 2D to the 3D domain. To tackle
these challenges, we propose a novel Cross-modality Noisy Supervision (CNS)
method that leverages the strengths of CLIP and SAM to supervise 2D and 3D
networks simultaneously. In particular, we introduce a prediction consistency
regularization to co-train 2D and 3D networks, then further impose the
networks' latent space consistency using the SAM's robust feature
representation. Experiments conducted on diverse indoor and outdoor datasets
demonstrate the superior performance of our method in understanding 2D and 3D
open environments. Our 2D and 3D network achieves label-free semantic
segmentation with 28.4% and 33.5% mIoU on ScanNet, improving 4.7% and 7.9%,
respectively. And for nuScenes dataset, our performance is 26.8% with an
improvement of 6%. Code will be released
(https://github.com/runnanchen/Label-Free-Scene-Understanding)
Vid2Curve: Simultaneous Camera Motion Estimation and Thin Structure Reconstruction from an RGB Video
Thin structures, such as wire-frame sculptures, fences, cables, power lines,
and tree branches, are common in the real world. It is extremely challenging to
acquire their 3D digital models using traditional image-based or depth-based
reconstruction methods because thin structures often lack distinct point
features and have severe self-occlusion. We propose the first approach that
simultaneously estimates camera motion and reconstructs the geometry of complex
3D thin structures in high quality from a color video captured by a handheld
camera. Specifically, we present a new curve-based approach to estimate
accurate camera poses by establishing correspondences between featureless thin
objects in the foreground in consecutive video frames, without requiring visual
texture in the background scene to lock on. Enabled by this effective
curve-based camera pose estimation strategy, we develop an iterative
optimization method with tailored measures on geometry, topology as well as
self-occlusion handling for reconstructing 3D thin structures. Extensive
validations on a variety of thin structures show that our method achieves
accurate camera pose estimation and faithful reconstruction of 3D thin
structures with complex shape and topology at a level that has not been
attained by other existing reconstruction methods.Comment: Accepted by SIGGRAPH 202